Problem Statement -

Build your own recommendation system for products on an e-commerce website like Amazon.com. Online E-commerce websites like Amazon, Filpkart uses different recommendation models to provide different suggestions to different users.
Amazon currently uses item-to-item collaborative filtering, which scales to massive data sets and produces high-quality recommendations in real time. This type of filtering matches each of the user's purchased and rated items to similar items, then combines those similar items into a recommendation list for the user.
In this project we are going to build recommendation model for the electronics products of Amazon.
The dataset here is taken from the below website.
Source - Amazon Reviews data (http://jmcauley.ucsd.edu/data/amazon/) The repository has several datasets. For this case study, we are using the Electronics dataset.


Dataset columns - first three columns are userId, productId, and ratings and the fourth column is timestamp. You can discard the timestamp column as in this case you may not need to use it.

In [1]:
import pandas as pd
import numpy as np
from surprise import Reader, Dataset, evaluate
from surprise import SVD, KNNBaseline, KNNBasic, NMF, NormalPredictor, BaselineOnly, KNNWithMeans
from surprise.model_selection import cross_validate
from surprise.model_selection import KFold
from collections import defaultdict
from surprise import accuracy

import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
reader = Reader()
dfData = pd.read_csv('ratings_Electronics.csv', names=['userId','productId','ratings','timestamp'])
dfData.head()
Out[2]:
userId productId ratings timestamp
0 AKM1MP6P0OYPR 0132793040 5.0 1365811200
1 A2CX7LUOHB2NDG 0321732944 5.0 1341100800
2 A2NWSAGRHCP8N5 0439886341 1.0 1367193600
3 A2WNBOD3WNDNKT 0439886341 3.0 1374451200
4 A1GI0U4ZRJA8WN 0439886341 1.0 1334707200
In [3]:
dfData.shape
Out[3]:
(7824482, 4)
In [4]:
dfWithoutTimeStamp = dfData.drop(['timestamp'], axis=1)
dfWithoutTimeStamp.head()
Out[4]:
userId productId ratings
0 AKM1MP6P0OYPR 0132793040 5.0
1 A2CX7LUOHB2NDG 0321732944 5.0
2 A2NWSAGRHCP8N5 0439886341 1.0
3 A2WNBOD3WNDNKT 0439886341 3.0
4 A1GI0U4ZRJA8WN 0439886341 1.0

EDA

In [5]:
from plotly.offline import init_notebook_mode, plot, iplot
import plotly.graph_objs as go
init_notebook_mode(connected=True)
In [6]:
data = dfWithoutTimeStamp['ratings'].value_counts().sort_index(ascending=False)
trace = go.Bar(x = data.index,
               text = ['{:.1f} %'.format(val) for val in (data.values / dfWithoutTimeStamp.shape[0] * 100)],
               textposition = 'auto',
               textfont = dict(color = '#000000'),
               y = data.values,
               )
# Create layout
layout = dict(title = 'Distribution Of {} Ratings'.format(dfWithoutTimeStamp.shape[0]),
              xaxis = dict(title = 'Rating'),
              yaxis = dict(title = 'Count'))
# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)
Ratings: we can see that data contains maximum data of ratings = 5 followed by 4 and 1

Ratings Distribution By Product

In [7]:
data = dfWithoutTimeStamp.groupby('productId')['ratings'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per Product',
                   xaxis = dict(title = 'Number of Ratings Per Product'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)
Ratings per Product:We can see that this dataset also contains products with zero ratings count. Less bought products have high ratings
In [8]:
dfWithoutTimeStamp.groupby('productId')['ratings'].count().reset_index().sort_values('ratings', ascending=False)[:10]
Out[8]:
productId ratings
308398 B0074BW614 18244
429572 B00DR0PDNE 16454
327308 B007WTAJTO 14172
102804 B0019EHU8G 12285
296625 B006GWO5WK 12226
178601 B003ELYQGG 11617
178813 B003ES5ZUU 10276
323013 B007R5YDYA 9907
289775 B00622AG6S 9823
30276 B0002L5R78 9487
Ratings per Product:Product 'B0074BW614' has got the maximum ratings

Ratings Distribution By User

In [9]:
data = dfWithoutTimeStamp.groupby('userId')['ratings'].count().clip(upper=50)

# Create trace
trace = go.Histogram(x = data.values,
                     name = 'Ratings',
                     xbins = dict(start = 0,
                                  end = 50,
                                  size = 2))
# Create layout
layout = go.Layout(title = 'Distribution Of Number of Ratings Per User',
                   xaxis = dict(title = 'Ratings Per User'),
                   yaxis = dict(title = 'Count'),
                   bargap = 0.2)

# Create plot
fig = go.Figure(data=[trace], layout=layout)
iplot(fig)
Ratings per User:Which user has given how many ratings
In [10]:
dfWithoutTimeStamp.groupby('userId')['ratings'].count().reset_index().sort_values('ratings', ascending=False)[:10]
Out[10]:
userId ratings
3263531 A5JLAU2ARJ0BO 520
3512451 ADLVFFE4VBT8 501
2989526 A3OXHLG6DIBRW8 498
3291008 A6FIAB28IS79 431
3284634 A680RUE1FDO8B 406
755206 A1ODOGXEYECQQ8 380
2424036 A36K2N527TXXJN 314
1451394 A2AY4YUOX2N1BQ 311
4100926 AWPODHOB4GFWL 308
1277963 A25C2M3QF9G7OQ 296
Ratings per User:User 'A5JLAU2ARJ0BO' has given the highest ratings, so appears that he is more active in shopping

Filter out data for users rated more than 50 times. This will reduce the sparse matrix to little dense one

In [11]:
#keep data of users who have rated more than 50 times
min_user_ratings = 50
filter_users = dfWithoutTimeStamp['userId'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

dfFiltered = dfWithoutTimeStamp[dfWithoutTimeStamp['userId'].isin(filter_users)]
print('The original data frame shape:\t{}'.format(dfWithoutTimeStamp.shape))
print('The new data frame shape:\t{}'.format(dfFiltered.shape))
The original data frame shape:	(7824482, 3)
The new data frame shape:	(122171, 3)
Reduce Sparse Matrix:As seen in above graphs the data contains users who have given rare ratings to many products. So we need to consider only those who have atleast 50 ratings of the products so that we can come with proper model of recommendation

Popularity Recommender model

In [12]:
dfPopularity = dfFiltered.copy()
# dropping duplicate values 
dfPopularity.drop_duplicates(subset=['userId'] ,keep='first',inplace=True)
In [13]:
# users = dfPopularity['userId'].unique()
# len(users) 
In [14]:
# products = dfPopularity['productId'].unique()
# len(products) 

Use train test split to break data before applying the popularity model

In [15]:
from sklearn.model_selection import train_test_split
train_data, test_data = train_test_split(dfPopularity, test_size = 0.30, random_state=0)

Product ID aggregation using count on rating

In [16]:
import Recommender_CountRating
In [17]:
#initialize the recommender model and create using the train data
pmCountRating = Recommender_CountRating.Popularity_Recommender()
pmCountRating.create(train_data, 'userId', 'productId','ratings')
In [18]:
#use the popularity model to make some prediction on test data and recommend top 10 items
users = test_data['userId'].unique()
user_id = users[5]
pmCountRating.recommend(user_id)
Out[18]:
userId productId ratings score Rank
490 AWNJAY0M5UI70 B000067RT6 5.0 13 1.0
325 AWNJAY0M5UI70 B00004ZCJE 5.0 12 2.0
130 AWNJAY0M5UI70 B00001P4ZH 4.0 6 3.0
324 AWNJAY0M5UI70 B00004ZCJE 4.0 6 4.0
430 AWNJAY0M5UI70 B00005T3G0 5.0 6 5.0
323 AWNJAY0M5UI70 B00004ZCJE 3.0 5 6.0
149 AWNJAY0M5UI70 B00001WRSJ 5.0 4 7.0
154 AWNJAY0M5UI70 B00001ZWXA 5.0 4 8.0
228 AWNJAY0M5UI70 B00004T8R2 4.0 4 9.0
229 AWNJAY0M5UI70 B00004T8R2 5.0 4 10.0
In [19]:
# create test dataset later used in RMSE calculation
dfCountRatingTest = pmCountRating.recommend(user_id)
In [20]:
# # now test the model
# pmCountRating.create(test_data, 'userId', 'productId','ratings')
# #user the popularity model to make some prediction
# user_id = users[5]
# pmCountRating.recommend(user_id)
In [21]:
# dfCountRatingTest = pmCountRating.recommend(user_id)
In [22]:
#combined rmse value
y_test = dfCountRatingTest.ratings
y_pred = 5 # let's assume that predicted value is 5, and then see what difference it comes up with the test data
rss=((y_test-y_pred)**2).sum()
mse=np.mean((y_test-y_pred)**2)
print("Final rmse value is =",np.sqrt(np.mean((y_test-y_pred)**2)))
Final rmse value is = 0.8366600265340756

Product ID aggregation using mean on rating

In [23]:
import Recommender_MeanRating
In [24]:
# from sklearn.model_selection import train_test_split
# train_data, test_data = train_test_split(dfPopularity, test_size = 0.30, random_state=0)
In [25]:
#initialize the recommender model and create using the train data
pmMeanRating = Recommender_MeanRating.Popularity_Recommender()
pmMeanRating.create(train_data, 'userId', 'productId', 'ratings')
In [26]:
#use the popularity model to make some prediction on test data and recommend top 10 items
products = test_data['productId'].unique()
item_id = products[5]
pmMeanRating.recommend(item_id)
Out[26]:
ratings userId productId
1025 5.0 AZOK5STV85FBJ B00005RI9I
1024 4.0 AZNUHQSHZHSUE B00005RI9I
1023 5.0 AZMY6E8B52L2T B00005RI9I
1022 4.0 AZCE11PSTCH1L B00005RI9I
1021 4.0 AZBXKUH4AIW3X B00005RI9I
1020 4.0 AZ8XSDMIX04VJ B00005RI9I
1019 4.0 AZ515FFZ7I2P7 B00005RI9I
1018 3.0 AYOTEJ617O60K B00005RI9I
1017 5.0 AYOMAHLWRQHUG B00005RI9I
1016 5.0 AYO1146CBIV5C B00005RI9I
In [27]:
dfMeanRatingTest = pmMeanRating.recommend(item_id)
In [28]:
# pmMeanRating.create(test_data, 'userId', 'productId', 'ratings')
# item_id = products[5]
# pmMeanRating.recommend(item_id)
In [29]:
# dfMeanRatingTest = pmMeanRating.recommend(item_id)
In [30]:
#combined rmse value
y_test = dfMeanRatingTest.ratings # this is the rating column itself which is renamed in Recommender.py file
y_pred = 5 # let's assume that predicted value is 5, and then see what difference it comes up with the test data
rss=((y_test-y_pred)**2).sum()
mse=np.mean((y_test-y_pred)**2)
print("Final rmse value is =",np.sqrt(np.mean((y_test-y_pred)**2)))
Final rmse value is = 0.9486832980505138

Summary Report of Popularity Model

Count Based Approach:The RMSE calculated on test data is = 0.83
Mean Based Approach:The RMSE calculated on test data is = 0.94, here we have kept same product and ranked in by different users

Collaborative Filtering model

In [31]:
reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(dfFiltered, reader)
In [32]:
benchmark = []
# Iterate over all algorithms
# SVD/NMF are model based approaches
# non-parametric KNN** are the memory based approaches

for algorithm in [SVD(), NMF(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), BaselineOnly()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Out[32]:
test_rmse fit_time test_time
Algorithm
BaselineOnly 0.979688 0.532867 0.627028
SVD 0.983065 12.041852 0.768447
KNNBaseline 1.043644 0.741310 2.287431
KNNWithMeans 1.064237 0.304685 1.915381
KNNBasic 1.110416 0.233521 1.880884
NMF 1.241539 19.523780 0.636952
NormalPredictor 1.362094 0.280975 0.687958

BaseLineOnly performed well, with RMSE closer to zero.

Train and Predict

In [33]:
print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)
Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Out[33]:
{'test_rmse': array([0.9760419, 0.9732867, 0.9832005]),
 'fit_time': (0.2844209671020508, 0.3431272506713867, 0.33828163146972656),
 'test_time': (0.6979992389678955, 0.6490037441253662, 0.6809372901916504)}
In [34]:
# data.df
In [35]:
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)
Estimating biases using als...
RMSE: 0.9750
Out[35]:
0.9749897527450923

Summary Report of Collaborative Model

Base Line Only : Among different models, Base Line performed much better with an RMSE = 0.9750

Recommend 10 products

In [36]:
pmCountRating.recommend(user_id)
Out[36]:
userId productId ratings score Rank
490 AWNJAY0M5UI70 B000067RT6 5.0 13 1.0
325 AWNJAY0M5UI70 B00004ZCJE 5.0 12 2.0
130 AWNJAY0M5UI70 B00001P4ZH 4.0 6 3.0
324 AWNJAY0M5UI70 B00004ZCJE 4.0 6 4.0
430 AWNJAY0M5UI70 B00005T3G0 5.0 6 5.0
323 AWNJAY0M5UI70 B00004ZCJE 3.0 5 6.0
149 AWNJAY0M5UI70 B00001WRSJ 5.0 4 7.0
154 AWNJAY0M5UI70 B00001ZWXA 5.0 4 8.0
228 AWNJAY0M5UI70 B00004T8R2 4.0 4 9.0
229 AWNJAY0M5UI70 B00004T8R2 5.0 4 10.0
In [ ]:
 
In [37]:
def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
In [38]:
   
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
In [39]:
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)
best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]
In [40]:
best_predictions
Out[40]:
uid iid rui est details Iu Ui err
6445 A14JBDSWKPKTZA B000N99BBC 5.0 5.0 {'was_impossible': False} 70 118 0.0
3082 A2FXBWR4T4OFQ B000QUUFRW 5.0 5.0 {'was_impossible': False} 58 67 0.0
6131 A1QRST0A3IQIEF B0027AGK3M 5.0 5.0 {'was_impossible': False} 39 12 0.0
11075 A25FL6VLD7S23S B001F51G16 5.0 5.0 {'was_impossible': False} 83 28 0.0
4265 A2VODABWSVHV8E B003WUBIZQ 5.0 5.0 {'was_impossible': False} 81 15 0.0
1071 A2GMZZ6TDYOHY7 B002V8C3W2 5.0 5.0 {'was_impossible': False} 48 31 0.0
1072 A2NYK9KWFMJV4Y B00EKAPZ8S 5.0 5.0 {'was_impossible': False} 173 6 0.0
25540 A2TN0U8173HM7A B000BQ7GW8 5.0 5.0 {'was_impossible': False} 50 32 0.0
6171 A1L64KDYO5BOJA B0019EHU8G 5.0 5.0 {'was_impossible': False} 83 65 0.0
13321 A1Z7SC7HH1BJKA B004LNXO28 5.0 5.0 {'was_impossible': False} 39 9 0.0
Best Predictions : We can see that the error = 0, meaning these are the actual users who really broight the item and then rated it.
In [41]:
dfFiltered.loc[dfFiltered['productId'] == 'B000067RT6']['ratings'].hist()
plt.xlabel('rating')
plt.ylabel('Number of ratings')
plt.title('Number of ratings product B000067RT6 has received')
plt.show();
In [ ]:
 
In [42]:
worst_predictions
Out[42]:
uid iid rui est details Iu Ui err
12311 A2NOW4U7W3F7RI B004O0TREC 1.0 4.848872 {'was_impossible': False} 197 3 3.848872
24280 AUK79PXTAOJP9 B0054MLMLA 1.0 4.854482 {'was_impossible': False} 43 1 3.854482
17005 A35KBAQ4VBNQ6L B00BGGDVOO 1.0 4.855068 {'was_impossible': False} 39 46 3.855068
709 AHF4I1FSIHABC B00007E7JU 1.0 4.861252 {'was_impossible': False} 45 49 3.861252
22613 A3FFZQKCA7UOYY B000GIT002 1.0 4.867678 {'was_impossible': False} 47 7 3.867678
18736 A6KL17KKN0A5L B000JE7GPY 1.0 4.870710 {'was_impossible': False} 36 37 3.870710
21761 A1KY5G5FP31F2F B000LRMS66 1.0 4.877670 {'was_impossible': False} 55 53 3.877670
19052 AHF4I1FSIHABC B006TT91TW 1.0 4.951336 {'was_impossible': False} 45 26 3.951336
20390 A1H55L0BLPCWYF B0002L5R78 1.0 4.989999 {'was_impossible': False} 40 49 3.989999
15003 A16SRDVPBXN69C B000YBH4YU 1.0 5.000000 {'was_impossible': False} 47 6 4.000000
Worst Predictions : Here we see the error values > 0. This shows that these users may not have actually bought the item but just being liked by others or they themselves have gone through the reviews of others, based on which they have rated the item
In [43]:
dfFiltered.loc[dfFiltered['productId'] == 'B000YBH4YU']['ratings'].hist()
plt.xlabel('rating')
plt.ylabel('Number of ratings')
plt.title('Number of ratings product B000YBH4YU has received')
plt.show();

Insight

Collaborative Modelling is much better than the Popularity based modelling. Here the item-item and user-user way of idntifying the associations and then ranking them has more impact and netter accuracy compared to the popularity based models where we rely solely on for example ranking and based on it recommend items to others Though in this dataset the RMSE of Count based Popularity model is less than compared to Mean and Collaborative based. We can use the Popularity model to recommend the electronic items for this dataset
In [ ]:
 
In [ ]:
 
In [ ]: